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Abstract 

We  describe  a  system  for  interactive,  real  time,  multi-camera  video  capture  and  display.  The  system 
uses  largely  general  purpose,  programmable  hardware,  and  as  a  result  is  flexible  and  expandable.  The 
computational  framework  and  data  storage  is  provided  by  the  iWarp  parallel  computer,  which  we 
describe  in  light  of  the  performance  requirements  of  real  time  video.  Video  display  is  accomplished 
using  a  High  Performance  Parallel  Interface  network  to  write  to  a  high  resolution  frame  buffer.  Video 
can  be  displayed  as  it  is  captured,  with  only  a  single  frame  latency.  We  provide  interactivity  with  a 
VCR-like  graphical  interface  running  on  a  host  workstation,  which  in  turn  controls  the  operation  of  the 
capture  system.  As  a  whole,  this  system  allows  a  user  to  interactively  monitor,  capture,  and  replay  video 
with  the  ease  of  use  of  a  VCR,  yet  with  flexibility  and  performance  that  is  unavailable  in  all  but  the  most 
expensive  digital  VCRs.  We  describe  the  implementation  in  detail,  and  discuss  possible  future  enhance¬ 
ments. 
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1.  Introduction 

For  the  purposes  of  a  fast,  accurate  multi-baseline  stereo  vision  system  currently  under  development  [1],  a 
video  capture  and  display  system  was  required  to  meet  the  following  specifications: 

•  multiple  cameras 

•  high  frame  rate  (30  Hz) 

•  high  resolution  and  image  quality 

•  interactive,  flexible  user  interface 

In  particular,  in  order  to  provide  an  interactive  capture  process,  it  is  necessary  to  include  video  display  capabil¬ 
ities.  Ideally,  the  video  would  be  displayed  during  capture,  providing  immediate  feedback  to  the  user. 

Our  system  meets  all  of  these  goals,  and  provides  considerable  flexibility  for  future  expansion.  In  the  current 
implementation  it  supports  four  synchronized  cameras  sampling  512x480  8-bit  grayscale  images  at  30  Hz.  The 
foundation  of  this  system  is  an  iWarp  parallel  computer  [2]  [3],  which  manages  the  overall  data  flow.  Video 
input  to  the  iWarp  is  performed  by  locally  developed  hardware.  The  video  data  is  stored  in  the  iWarp’s  local 
memory,  and  simultaneously  sent  via  a  High  Performance  Parallel  Interface  (HiPPI)  network  to  a  frame  buffer, 
where  all  four  images  are  displayed  at  full  resolution  in  real  time.  The  user  directs  this  process  with  an  X  Win¬ 
dows  application  running  on  a  workstation. 
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2.  iWarp  Architecture 


The  iWarp  computer  we  are  using  has  64  cells  in  an  8x8  array.  Each  cell  consists  of  a  processor,  its  local  mem¬ 
ory,  and  support  hardware.  The  processor  itself  consists  of  a  computation  agent  and  a  communication  agent, 
the  computation  agent  is  under  program  control,  while  the  communication  agent  transparently  supports  the 
communication  needs  of  the  processor.  Each  cell  is  connected  to  each  of  its  four  nearest  neighbors  by  a  full- 
duplex,  40  MB/s  physical  bus;  the  boundaries  of  the  array  are  connected  together,  resulting  in  a  torus. 

The  iWarp  executes  a  single  army  program  at  a  time.  An  array  program  consists  of  a  set  of  programs  which 
execute  on  individual  cells,  a  mapping  of  programs  to  cells,  and  a  set  of  logical  communication  pathways  to  be 
established  between  processors.  Before  an  array  program  is  executed,  the  entire  array  is  reset  and  a  resident 
monitor,  the  iWarp  Run  Time  System  (iW/RTS),  is  loaded  onto  every  cell. 

Each  cell  has  512kB  of  fast  static  RAM,  which  is  used  for  the  iW/RTS,  user  program,  and  data.  The  cells  in 
rows  0-3  also  have  16MB  each  of  slower  dynamic  RAM,  which  in  our  case  is  used  for  video  data  storage.  With 
a  combined  total  of  512MB  of  dynamic  RAM,  these  cells  can  store  roughly  17  seconds  of  four  camera  full 
frame  rate  video. 

The  iWarp  provides  hardware  expandability  via  two  mechanisms.  First,  special  purpose  auxiliary  cells  can  be 
connected  to  the  array.  These  cells  contain  standard  iWarp  processors  and  communicate  with  the  rest  of  the 
array  normally;  however,  they  also  perform  a  dedicated  hardware  function.  One  such  cell,  the  Sun  Interface 
Board  (SIB),  provides  mechanisms  whereby  a  host  workstation  can  control  the  iWarp  array  and  provides  some 
I/O  capability  to  the  array  program.  Another  pair  of  auxiliary  cells,  the  HiPPI  Interface  Boards  (HIBs),  provide 
extremely  high  bandwidth,  full  duplex  communications  to  other  devices  on  a  High  Performance  Parallel  Inter¬ 
face  (HiPPI)  network. 

While  auxiliary  cells  offer  tremendous  flexibility,  the  design  effort  is  considerable.  Alternatively,  each  general 
purpose  iWarp  cell  has  an  external  memory  bus  to  which  a  memory-mapped  I/O  device  can  be  connected.  The 
iWarp's  10  MB/s  synchronous  memory  access  rate  enables  high  bandwidth  I/O  to  be  performed  via  this  con¬ 
nector,  as  long  as  the  control  requirements  are  relatively  simple.  The  video  interface,  described  below,  takes 
advantage  of  this  external  memory  bus. 

Figure  1  shows  the  configuration  of  the  iWarp  array  used  in  our  research. 


2.1.  Video  Interface 


We  use  four  black  and  white,  NTSC  interlaced  video  cameras.  The  NTSC  standard  specifies  525  lines  per 
frame,  divided  into  two  interlaced  fields.  Each  field  contains  roughly  240  lines  of  useful  information,  with  the 
remainder  of  the  field  devoted  to  the  vertical  retrace  interval. 

With  525  lines  per  frame  and  30  frames  per  second,  each  scanline  is  roughly  63.5  microseconds  (|xs)  long. 
Horizontal  retrace  and  blanking  take  up  some  of  that  interval,  leaving  a  useful  video  portion  of  about  51.2  |ls. 
Using  a  10  MHz  sampling  rate  provides  512  pixels  of  horizontal  resolution.  The  result  is  a  512x480  image  for 
each  frame,  consisting  of  two  fields  which  are  captured  l/60th  of  a  second  apart.  The  overall  data  rate  is 
slightly  under  30  MB/s;  however,  data  is  captured  in  bursts,  with  a  burst  rate  of  exactly  40  MB/s,  and  with 
short  pauses  between  lines  and  longer  pauses  between  fields. 

The  actual  video  sampling  is  performed  by  specialized  hardware  [4]  which  performs  simultaneous  A/D  con¬ 
version  of  four  synchronized  video  signals.  Each  A/D  converter  produces  an  8  bit  intensity  for  each  pixel.  The 
four  pixels  are  concatenated  into  a  single  32-bit  word.  This  word  is  read  from  the  external  memory  bus  of  a 
general  purpose  iWarp  cell,  the  capture  cell. 
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Figure  1 .  Hardware  overview 

The  design  of  the  capture  hardware  was  considerably  simplified  by  not  including  a  buffer  between  the  A/D 
converter  and  the  iWarp  interface.  However,  the  lack  of  buffering  makes  the  task  of  the  capture  software  par¬ 
ticularly  difficult.  Since  we  are  using  a  10  MHz  sampling  rate,  a  sample  must  be  acquired  on  alternate  cycles  of 
the  20  MHz  system  clock.  Since  the  memory  read  instruction  takes  two  clock  cycles,  the  capture  cell  must  read 
a  sample  from  the  video  capture  hardware  and  transmit  it  to  another  cell  in  the  array  in  every  instruction.  Any 
delay,  even  a  single  clock  cycle,  causes  unacceptable  error  in  the  resulting  image. 


The  fact  that  this  is  possible,  and  indeed  practical,  is  due  to  several  notable  features  of  the  iWarp  architecture. 
A  brief  summary  of  these  features  follows. 
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3.  iWarp  Features 

3.1.  Systolic  communication  model 

In  the  traditional  message  passing  model,  messages  are  accumulated  in  memory  and  transmitted  as  a  unit  to  a 
destination  cell.  In  our  application,  however,  the  overhead  imposed  by  memory  access  and  message  processing 
would  make  real-time  performance  completely  impossible.  At  the  very  least,  storing  each  value  in  a  memory 
buffer  after  it  is  sampled  would  require  a  second  memory  access  in  the  inner  loop,  reducing  the  frame  rate  to 
15  Hz.  Even  greater  performance  loss  would  be  likely  when  the  accumulated  message  is  transmitted.  However, 
with  systolic  communications,  single  words  are  transmitted  and  passed  independently  through  the  array;  in 
other  words,  the  message  grain  size  can  be  as  small  as  a  single  word.  This  obviates  the  need  for  buffering 
before  transmission. 

An  iWarp  program  uses  the  systolic  capability  of  the  iWarp  via  special  registers  called  gates,  which  serve  as 
endpoints  for  communication.  When  the  processor  writes  a  value  to  a  gate,  it  is  automatically  transmitted  to  a 
predefined  destination;  when  a  processor  reads  from  a  gate,  the  first  available  data  from  a  predefined  source  is 
provided.  The  communications  agent  has  short  queues  on  each  gate,  transparent  to  the  user,  which  provide  lim¬ 
ited  buffering.  If  the  receive  queue  is  empty  when  the  processor  attempts  to  read  from  a  gate,  the  instruction 
blocks  until  a  word  is  received.  A  write  to  a  gate  may  also  block  in  a  similar  manner.  Since  gates  are  repre¬ 
sented  as  registers,  no  special  instructions  are  necessary  to  send  and  receive  data,  and  indeed,  operands  or 
results  of  an  instruction  can  be  received  or  transmitted  via  gates  in  exactly  the  same  manner  as  an  ordinary  reg¬ 
ister  access. 


3.2.  Efficient  data  routing 


The  iWarp  provides  hardware-assisted  data  routing  with  logical  pathways  and  logical  ports.  A  logical  pathway 
is  a  data  transport  mechanism  with  a  predefined  source  and  destination  cell;  each  end  of  the  pathway  is 
assigned  a  logical  port,  which  the  program  uses  to  send  or  receive  data  over  the  pathway.  Data  passes  through 
the  cells  along  the  route  with  negligible  delay  and  no  performance  impact  upon  the  intermediate  cells.  Further¬ 
more,  several  logical  pathways  can  pass  over  a  single  physical  bus,  without  added  overhead,  and  without  con¬ 
tention  until  the  full  40MB/s  bandwidth  of  the  physical  bus  is  reached.  These  mechanisms  allow  the  user  to 
largely  ignore  the  physical  topology  of  the  array  and  the  details  of  data  routing,  multiplexing,  and  transport. 


3.3.  Zero-overhead  loops 

The  iWarp  processor  has  dedicated  support  for  automatic  loop  iteration  and  branching.  Using  a  special  loop 
count  register  and  a  dedicated  bit  field  in  the  instruction  set,  a  loop  need  not  contain  any  instructions  to  decre¬ 
ment  the  loop  counter  or  perform  the  branch;  that  is  all  performed  by  dedicated  hardware,  in  most  cases  with¬ 
out  any  time  penalty  or  overhead. 

Most  of  the  work  performed  by  the  video  capture  program  is  done  in  very  small  loops,  usually  only  two  or 
three  instructions,  which  must  be  executed  continuously.  The  additional  overhead  imposed  by  performing 
loops  in  software  would  make  the  software  design  task  more  difficult,  if  not  impossible.  For  instance,  although 
loops  could  be  unrolled  to  reduce  overhead,  they  must  remain  small  enough  to  fit  within  the  processor’s 
instruction  cache. 
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3.4.  Real  time  performance 

The  iWarp  has  a  very  rudimentary  operating  system  and  no  support  for  multitasking.  The  few  interrupt-based 
services  provided  by  the  iW/RTS  can  easily  be  disabled.  When  this  is  done,  the  execution  of  a  non-I/O-bound 
array  program  becomes  largely  deterministic.  (Instruction  caching  limits  the  predictability  to  some  extent.) 
Furthermore,  each  iWarp  processor  contains  a  hardware  timer  which  decrements  every  8  clock  cycles.  As  a 
result,  the  behavior  and  timing  of  an  array  program  can  be  made  predictable,  reliable,  repeatable,  and  measur¬ 
able  down  to  a  sub-microsecond  level.  This  is  necessary  to  our  application,  since  an  unpredictable  delay  on 
any  cell  which  processes  the  unbuffered  video  data  stream  could  cause  loss  of  data  at  the  capture  cell. 
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4.  Data  Flow  Part  I:  Capture  and  Storage 

The  inner  loop  which  captures  a  single  line  of  video  data  and  sends  it  continuously  to  another  cell  could  be  as 
simple  as  the  following^: 


loop  512 

endloop  load  {video_ADC_address) , gateO 

However,  this  causes  two  problems: 

a.  A  four-byte  word  is  being  transmitted  every  other  clock  cycle  within  the  video  scanline,  for  a  burst 
transfer  rate  of  40  MB/s.  This  is  precisely  the  maximum  theoretical  bandwidth  of  the  physical  bus¬ 
ses  used  to  connect  cells.  However,  a  bug  in  the  handshaking  mechanism  between  cells  results  in 
unacceptable  jitter  when  this  full  bandwidth  is  used  entirely  with  data  being  read  directly  from 
memory  (as  opposed  to  a  combination  of  data  passed  from  other  cells  and  data  from  memory). 

b.  Other  cells  must  process  the  data  stream,  and  if  one  sample  arrives  every  two  cycles,  not  much  time 
is  left  for  processing. 

In  order  to  reduce  the  peak  transfer  rate  from  40  MB/s,  we  split  the  video  data  stream,  by  transmitting  alternate 
samples  over  two  pathways  to  two  identical  sets  of  storage  cells.  As  a  result,  each  video  field  is  stored  as  two 
separate  half-fields,  one  with  the  even-numbered  pixels  from  each  scanline,  and  the  other  with  the  odd-num¬ 
bered  pixels  from  each  scanline.  The  inner  loop  which  actually  executes  on  the  capture  cell  is  as  follows: 


;  executes  once 
;  executes  512  times 


loop  256  ;  executes  once 

load  {video_ADC„address) , gateO  ;  executes  256  times 

endloop  load  {video_ADC_address) , gatel  ;  executes  256  times 

The  storage  cells  are  cells  0  through  31,  each  of  which  is  has  16  MB  of  dynamic  RAM.  These  cells  make  up 
the  top  half  of  the  array,  as  it  is  shown  in  Figure  2.  Of  these  32  cells,  the  left  16  store  the  half-fields  containing 
the  even-numbered  pixels,  and  the  right  16  store  the  half-fields  containing  the  odd-numbered  pixels.  The  stor¬ 
age  cells  in  each  half  are  connected  together  in  a  serial,  unidirectional  fashion.  The  capture  cell  is  connected  to 
the  first  storage  cell  in  each  chain,  and  the  last  storage  cell  is  connected  to  the  display  processing  cells,  which 
are  discussed  in  Section  5.  In  all,  these  cells  can  store  2048  half-fields,  which  make  up  512  frames,  or  slightly 
over  17  seconds  of  30  Hz  video. 

When  data  is  being  captured,  every  storage  cell  passes  the  incoming  data,  uninterrupted,  to  the  next  cell  in  the 
chain.  One  of  these  cells  also  copies  the  data  into  its  local  memory;  the  control  mechanism  which  determines 
which  cell  should  store  a  given  frame  is  discussed  in  Section  6. 

This  design  lends  itself  well  to  playback  after  capture  as  well  as  monitoring  during  capture.  In  order  to  display 
a  particular  frame,  the  storage  cell  containing  that  frame  transmits  the  frame  data,  and  every  subsequent  cell 
passes  it  on.  The  data  reaches  the  display  processing  cells  exactly  as  if  it  had  just  been  captured. 

4.1.  Image  Display 


We  considered  three  possible  methods  for  displaying  video  during  capture: 

a.  On  the  host  workstation.  It  would  be  possible  to  send  image  data  from  the  iWarp,  via  the  SIB,  to  the 
host,  and  then  display  it  on  the  host’s  monitor.  However,  the  performance  of  this  approach  would  be 


1.  The  assembly  language  syntax  used  here  is  modified  slightly  for  clarity. 
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Figure  2.  Capture  and  Storage 

completely  unacceptable.  The  SIB-to-host  communications  have  low  bandwidth  and  very  high, 
unpredictable  latency  imposed  by  the  Unix  operating  system.  Only  very  small,  “thumbnaiF’  images 
could  be  transmitted,  and  only  at  a  very  low  frame  rate  (a  few  Hertz);  even  then,  uniform  frame 
rates  could  not  be  guaranteed. 

b.  On  television  monitors.  In  addition  to  the  video  A/D  converters  described  above,  we  have  D/A  con¬ 
verters  which  operate  similarly,  and  provide  an  NTSC  signal  suitable  for  display  on  high  quality 
television  monitors.  Unfortunately,  the  D/A  converters  are  not  as  reliable  as  the  A/D  converters, 
and  exhibit  problems  with  synchronization  and  stability.  Furthermore,  using  separate  monitors  for 
each  camera  is  inelegant  and  scales  badly. 

c.  On  a  dedicated  framebulfer  via  a  high  speed  network.  This  is  the  approach  we  chose  to  implement. 

High  Performance  Parallel  Interface  (HiPPI)  is  a  connection  oriented,  switched,  very  high  bandwidth  network 
protocol  [5][6].  It  provides  guaranteed  100  MB/s  connections  between  devices,  using  crossbar  switches  to  cre¬ 
ate  small  networks  [7].  A  recent  collaboration  between  Carnegie  Mellon  University  and  the  Network  Systems 
Corporation  resulted  in  the  development  of  HiPPI  transmit  and  receive  hardware  for  the  iWarp  computer 
[9][10].  Like  most  HiPPI  interfaces,  the  iWarp  HiPPI  Interface  Boards  (HIBs)  do  not  achieve  the  full  theoreti¬ 
cal  data  rate  of  HiPPI.  The  transmit  HIB  (X-HIB)  achieves  a  reliable  42.5  MB/s  transfer  rate. 


We  also  have  a  HiPPI  frame  buffer  developed  by  Network  Systems  Corporation.  The  frame  buffer  receives 
data  in  a  simple  protocol  (Frame  Buffer  Protocol)  layered  over  the  HiPPI  framing  protocol  (“raw”  HiPPI).  It 
can  be  configured  for  any  window  size  up  to  1024x1024,  and  can  display  8-bit  grayscale,  8-bit  indexed  color, 
or  24-bit  color  images  using  a  number  of  pixel  formats.  The  frame  rate  is  limited  only  by  the  HiPPI  bandwidth 
and  the  size  of  the  window  being  updated.  Therefore,  using  the  X-HIB  and  the  HiPPI  framebuffer,  we  can  eas¬ 
ily  display  four  512x480  images  tiled  in  a  1024x960  window  at  30  Hz. 

It  should  be  noted  that,  in  order  for  an  expanded  implementation  to  display  more  than  four  images,  we  would 
need  to  subsample  the  images  or  allow  different  images  to  be  multiplexed  onto  the  display  at  different  times, 
because  of  the  fixed  resolution  of  the  monitor.  Either  of  these  possibilities  could  be  implemented  quite  straight¬ 
forwardly  and  would  not  greatly  hamper  the  usability  of  the  system. 

HiPPI  communications  on  the  iWarp  are  handled  by  a  software  suite  running  on  the  HiPPI  Interface  Boards, 
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called  the  HiPPI  Streams  Interface  (HSI)  [8].  The  HSI  serves  to  insulate  the  array  program  from  network- spe¬ 
cific  or  protocol-specific  details.  The  mechanism  of  this  abstraction  is  particularly  powerful:  the  iWarp’s  logi¬ 
cal  pathways.  When  a  connection  is  established,  the  user  program  tells  the  HSI  over  which  pathways  it  will  be 
transmitting  or  receiving  data,  and  how  that  data  should  be  packetized;  from  that  point  on,  as  far  as  the  user 
program  is  concerned,  network  communications  are  no  different  from  ordinary  inter-cell  communications.  The 
HSI  has  built-in  support  for  the  protocol  used  by  the  framebuffer. 
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5.  Data  Flow  Part  II:  Processing  for  Display 

A  significant  amount  of  processing  is  required  before  the  video  data  can  be  sent  to  the  HSI  and  the  framebuffer. 
The  data  arrives  with  three  levels  of  interleaving,  and  must  be  sent  to  the  frame  buffer  as  a  single  1024x960 
window.  The  necessary  processing  tasks  are  as  follows: 

a.  Pixel  interleaving.  Data  arrives  on  two  separate  pathways,  one  with  even  pixels,  the  other  with  odd 
pixels. 

b.  Image  demultiplexing.  Each  word  of  incoming  data  consists  of  four  8-bit  pixel  values,  one  from 
each  camera.  These  must  be  separated  so  that  each  image  can  be  transmitted  independently. 

c.  Line  interlacing.  The  even  lines  of  an  image  arrive  first,  followed  by  the  odd  lines. 

d.  Image  tiling.  The  resulting  four  images  must  be  combined  for  transmission  to  frame  buffer. 

These  tasks  could  conceivably  be  performed  in  many  different  ways.  However,  the  requirement  that  they  be 
implemented  using  the  31  available  iWarp  cells,  each  of  which  has  only  512kB  of  memory,  yet  provide  30  Hz 
frame  rate  with  as  little  latency  as  possible,  limits  the  possibilities  considerably.  For  example,  during  a  video 
scanline,  two  words  of  data  arrive  at  the  display  processing  cells  every  four  clock  cycles;  these  must  be  rear¬ 
ranged  and  then  sent  at  an  average  rate  of  30  MB/s  over  a  single  pathway.  Even  if  the  task  were  split  evenly 
across  all  available  processors,  without  any  blocking  or  overhead,  only  21  clock  cycles  would  be  available  to 
process  each  byte  of  data  sent  to  the  frame  buffer. 

Our  approach  involves  two  stages,  which  use  different  operational  paradigms.  The  first  stage,  which  performs 
pixel  de-interleaving  and  channel  demultiplexing,  operates  on  the  data  as  a  continuous  pixel  stream,  ignoring 
scanline  and  frame  boundaries.  Consecutive  cells  perform  simple  operations  on  the  data  stream  and  then  pass 
data  on  to  other  cells  for  further  processing.  The  second  stage,  which  performs  line  de-interlacing  and  image 
tiling,  uses  double-buffering  to  store  complete  frames  and  then  send  them,  line  by  line,  in  a  rearranged  order. 

Stage  1  is  best  understood  by  considering  the  first  four  pixels  of  a  frame  arriving  from  the  storage  cells.  Due  to 
pixel  interleaving,  the  first  and  third  pixels  arrive  on  one  pathway,  while  the  second  and  fourth  arrive  on 
another  pathway.  Due  to  pixel  multiplexing,  the  incoming  images  are  combined,  with  a  single  word  containing 
the  pixel  values  for  all  four  images.  The  output  of  stage  1  should  be  four  words  on  four  separate  pathways, 
each  one  of  which  contain  the  first  four  pixels  of  an  image.  Subsequent  blocks  of  four  pixels  are  processed 
identically.  The  operations  performed  by  each  cell  to  achieve  this  goal  are  illustrated  in  Figure  3. 

Stage  2  is  somewhat  simpler,  consisting  of  only  two  basic  types  of  cells:  8  buffer  cells,  and  one  switcher  cell. 
The  8  buffer  cells  are  broken  into  two  sets  of  four,  which  are  used  to  double-buffer  the  four  images.  While  one 
set  is  buffering  data  from  the  capture  cell,  the  other  is  transmitting  data  to  the  switcher  cell;  at  the  end  of  the 
frame,  the  two  sets  of  buffer  cells  trade  roles.  This  process  is  illustrated  in  Figure  4  by  physical  switches;  in 
reality,  the  switching  is  performed  by  the  combine  cells  (from  stage  1)  and  the  switcher  cell,  which  bind  alter¬ 
nating  sets  of  logical  ports  on  alternating  frames. 

Line  interlacing  is  performed  trivially  by  the  buffer  cells;  they  read  the  interlaced  video  data  directly  into  a 
memory  buffer,  and  then  send  individual  scanlines  from  the  buffer  in  such  a  way  that  the  switcher  cell  receives 
non-interlaced  data. 

The  switcher  cell  is  responsible  for  tiling  the  images  and  sending  them  to  the  X-HIB.  The  upper  half  of  the 
window  consists  of  images  1  and  2  side-by-side,  which  is  accomplished  by  passing  individual  scanlines  of 
images  1  and  2  in  alternation.  The  lower  half  of  the  window  is  filled  with  images  3  and  4  in  an  identical  man¬ 
ner. 
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(display  processing,  stage  2) 


a)  Data  Flow 


b)  Layout 


c)Key 


Figure  3.  Display  Processing  -  Stage  1 


c)  Key 


Figure  4.  Display  Processing  -  Stage  2 
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6.  Control  and  User  Interface 

The  videocassette  recorder  (VCR)  is  the  most  basic  and  v^idely  used  device  for  video  capture.  It  is  interactive 
and  flexible,  yet  familiar  and  easy  to  use.  It  provides  most  of  the  controls  we  need  to  capture  and  store  video 
from  multiple  cameras.  Hence,  we  sought  to  emulate  the  functionality  and  user  interface  of  a  VCR  as  much  as 
possible. 

This  was  achieved  with  a  graphical  user  interface  (GUI)  running  on  the  host  workstation.  A  snapshot  of  the 
interface  is  show  in  Figure  5.  This  interface  was  programmed  using  Tcl,  an  interpreted  programming  language, 
and  Tk,  a  toolkit  under  Tcl  which  allows  rapid  prototyping  and  development  of  X  Windows  applications. 
Using  a  small  library  of  C  routines  which  we  developed,  the  Tcl  program  starts  the  video  capture  array  pro¬ 
gram  on  the  attached  iWarp  array  and  exchanges  short  messages  with  the  array.  All  communication  with  the 
array  is  performed  via  the  Sun  Interface  board,  using  a  communications  facility  called  imsg  supported  by  the 
iW/RTS. 


Figure  5.  GUI  for  video  capture  program. 


The  GUI  program  sends  messages  to  the  SIB,  and  receives  return  messages,  in  what  we  call  the  external  proto¬ 
col.  The  external  protocol  concerns  actions  by  the  user,  and  is  restricted  to  a  relatively  coarse  level  of  control. 
The  SIB  then  performs  real-time,  frame-by-frame  control  of  the  rest  of  the  iWarp  array,  using  the  internal  pro¬ 
tocol  to  communicate  with  the  other  cells. 

The  external  protocol  is  used  to  transmit  user  actions  to  the  array,  and  to  update  the  state  of  the  GUI  (to  reflect, 
for  instance,  the  number  of  frames  available  for  recording).  However,  the  imsg  mechanism  used  by  the  exter¬ 
nal  protocol  has  a  number  of  drawbacks  which  limit  its  utility,  imsg  has  a  high,  unpredictable  message  latency 
due  in  large  part  to  the  Unix  operating  system  on  the  host,  so  it  is  impossible  for  the  host  to  exercise  real  time 
control  of  the  array.  Also,  imsg  sends  are  inherently  blocking  in  nature,  so  the  SIB  cannot  send  a  message  to 
the  host  and  continue  to  control  the  array  in  real  time.  This  means  that  certain  desirable  features  cannot  be 
implemented,  such  as  a  frame  counter  on  the  GUI  which  increments  during  recording  or  playback  to  reflect  the 
actual  current  frame  number. 

The  internal  protocol  consists  of  short  control  messages,  which  are  sent  over  a  dedicated  pathway  connecting 
every  cell  in  the  array  in  a  closed  loop.  During  the  video  retrace  interval  between  frames,  the  SIB  sends  a  mes¬ 
sage,  e.g.  to  capture  a  frame  into  memory  at  a  specific  location.  Each  cell  copies  and  passes  the  message,  until 
it  returns  to  the  SIB.  From  the  content  of  the  message,  each  cell  can  determine  what  it  needs  to  do  for  the  next 
frame.  The  message  takes  about  47  |Lts  to  pass  through  the  entire  array,  leaving  time  to  spare  before  the  end  of 
the  1.25  ms  vertical  retrace  interval.  After  the  ensuing  frame  is  complete,  all  of  the  cells  switch  back  to  the 
message  pathway  and  wait  for  further  instructions. 
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6.1.  Performance 


The  resulting  system  works  as  described,  and  provides  all  the  usual  capabilities  of  a  conventional  VCR  — 
play,  record,  fast  forward,  cue,  and  rewind  —  through  a  convenient  graphical  interface.  It  also  supports  many 
capabilities  important  in  research  in  multi-baseline  stereo  research,  such  as  variable  capture  rates  and  integra¬ 
tion  with  a  computer  system  (allowing  frame  capture  to  be  triggered  under  program  control,  and  providing 
accurate  per-image  timestamps,  allowing  the  video  data  to  be  related  to  data  from  other  devices). 

The  peak  transfer  rate  within  iWarp  is  40  MB/s.  The  average  transfer  rate  within  iWarp  and  over  HiPPI  is  30 
MB/s,  a  significant  fraction  of  the  peak  42.5  MB/s  achievable  with  the  current  hardware  interface.  We  have 
recently  expanded  the  system  to  support  four  video  capture  boards,  for  an  aggregate  peak  bandwidth  within 
iWarp  of  160  MB/s. 

One  might  think  that  a  “HiPPI  VCR”  such  as  this  is  merely  an  expensive  demonstration  of  the  same  capabili¬ 
ties  available  cheaply  from  consumer  electronics.  But  this  is  not  true.  Conventional  VCRs  do  not  support 
frame-synchronized  capture  of  video  data  from  multiple  cameras,  do  not  allow  variable  capture  rates,  do  not 
record  data  reliably  in  digital  form,  and,  most  important  in  our  application,  are  not  tightly  integrated  into  a 
computer  system  that  supports  transfer  of  the  video  imagery  to  disk  for  processing,  or  transmission  to  other 
high  performance  computing  devices  for  further  processing.  Measured  against  a  system  that  provides  such 
capabilities,  our  system  is  quite  cost-effective  (iWarp’s  cost  was  roughly  $500K  in  1992). 
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7.  Future 

There  are  a  number  of  paths  for  future  expansion  and  improvement  which  can  easily  be  added  to  our  existing 
architecture.  We  are  considering: 

1.  Additional  camera  support.  Video  interfaces  can  be  attached  to  up  to  8  cells,  allowing  simultaneous 
capture  from  32  synchronized  cameras.  Naturally,  there  are  trade-offs  involved,  since  the  storage 
and  computational  capabilities  of  the  iWarp  are  fixed.  Some  features  which  may  be  sacrificed  in 
order  to  support  a  larger  number  of  cameras  include  frame  rate,  storage  capacity,  and  display  capa¬ 
bility. 

2.  External  storage  over  HiPPI.  The  Parallel  Data  Laboratory  at  Carnegie  Mellon  University  is  devel¬ 
oping  a  large  (100+  GB)  storage  server  which  should  provide  high  bandwidth  disk  storage  over 
HiPPI  [11],  Much  of  the  software  developed  for  image  display  will  be  directly  applicable  to  image 
storage  over  HiPPI,  including  the  user  interface  and  video  processing.  Furthermore,  since  addi¬ 
tional  processing  could  be  performed  in  the  iWarp  cells  currently  devoted  to  storage,  a  greater  num¬ 
ber  of  cameras  could  be  supported  at  high  frame  rates.  However,  the  limited  throughput  of  the 
HiPPI  adapter  and  storage  server  may  necessitate  data  compression,  possibly  with  additional 
impact  on  performance. 

3.  External  stereo  vision  calculation  over  HiPPI.  In  the  past,  stereo  vision  processing  was  performed 
on  the  iWarp.  Since  more  powerful  computers  are  now  available,  the  iWarp  could  serve  as  a  dedi¬ 
cated  video  capture  system  and  feed  data  over  HiPPI  to  another  computer  (e.g.  Intel  Paragon) 
which  would  do  further  processing.  From  the  perspective  of  the  iWarp,  this  task  is  very  similar  to 
data  storage  on  an  external  server,  requiring  only  minor  changes  in  software  between  the  two  appli¬ 
cations. 

Hence,  this  system  has  a  clear  path  of  future  growth.  Ultimately,  it  could  lead  to  a  flexible,  modular,  expand¬ 
able  video  capture  and  processing  network  based  on  HiPPI,  using  largely  non- specialized  facilities  and  hard¬ 
ware. 
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